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Abstract 

o 

^N I This paper treats the problem of minimizing a general continuously differentiable 

function subject to sparsity constraints. We present and analyze several different op- 
timality criteria which are based on the notions of stationarity and coordinate-wise 
c/j ' optimality. These conditions are then used to derive three numerical algorithms aimed 

at finding points satisfying the resulting optimality criteria: the iterative hard thresh- 
olding method and the greedy and partial sparse-simplex methods. The first algorithm 
^ , is essentially a gradient projection method while the remaining two algorithms are of 

^^ I coordinate descent type. The theoretical convergence of these methods and their rela- 

l/^ ' tions to the derived optimality conditions are studied. The algorithms and results are 

''^ ' illustrated by several numerical examples. 

rn 

O 

1 Introduction 

'kJi • Sparsity has long been exploited in signal processing, applied mathematics, statistics and 

H ■ computer science for tasks such as compression, denoising, model selection, image processing 

c^ I 

- ■ ■ and more [TJ [H [HI [T71 [191 ^H [22] . Recent years have witnessed a growing interest in sparsity- 

based processing methods and algorithms for sparse recovery [21 [H [2l]. Despite the great 

interest in exploiting sparsity in various applications, most of the work to date has focused 

on recovering sparse data represented by a vector x G M" from linear measurements of 

the form b = Ax. For example, the rapidly growing field of compressed sensing |9l |6l |T6] 

considers recovery of a sparse x from a small set of linear measurements b G M™ where m is 

usually much smaller than n. Since in practice the measurements are contaminated by noise, 

a typical approach to recover x is to seek a sparse vector x that minimizes the quadratic 

function || Ax — bl^. 
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In this paper we study the more general problem of minimizing a continuously differentiable 
objective function subject to a sparsity constraint. More specifically, we consider the problem 

min /(x) 

s.t. ||x||o < s, 

where / : M" — )■ M is a continuously differentiable function, s > is an integer smaller than 
n and ||x||o is the io norm of x, which counts the number of nonzero components in x. We 
do not assume that / is a convex function. This, together with the fact that the constraint 
function is nonconvex, and is in fact not even continuous, renders the problem quite difficult. 
Our goal in this paper is to study necessary optimality conditions for problem (P) and to 
develop algorithms that find points satisying these conditions for general choices of /. 

Two instances of problem (P) that have been considered in previous literature and will 
serve as prototype models throughout the paper are described in the following two examples. 

Example 1.1 (Compressive Sensing). As mentioned above, compressed sensing is concerned 
with recovery of a sparse vector x from linear measurements Ax = b, where A G R™^", b G 
M™ and m is usually much smaller than n. It is well known that under suitable conditions on 
A, only the order of s logn measurements are needed to recover x [23]. When noise is present 
in the measurements, it is natural to consider the corresponding optimization problem (P) 
with the objective function given by 



/li(x) = ||Ax-b 
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A variety of algorithms have been proposed in order to approximate the solution to this prob- 
lem [231 [21] • One popular approach is to replace the io norm with the convex ii norm, which 
results in a convex problem. A variety of different greedy methods have also been proposed, 
such as the matching pursuit (MP) and orthogonal MP (OMP) algorithms [18]. We will relate 
our methods to these approaches in Section [3.2.1[ Another method that was proposed recently 
and is related to our approach below is the iterative hard thresholding algorithm [4], also re- 
ferred to as the "M-sparse" method. In [1] the authors consider a majorization- minimization 
approach to solve (P) with f = fu, and show that the resulting method converges to a local 
minima of (P) as long as the spectral norm of A satisfies ||A|| < 1. This algorithm is essen- 
tially a gradient projection method with stepsize 1. In Section ISTT] we will revisit the iterative 
hard thresholding method and show how it can be applied to the general formulation (P), as 
well as discuss the quahty of the limit points of the sequence generated by the algorithm. 

Although linear measurements are the most popular in the literature, recently, attention 
has been given to quadratic measurements. Sparse recovery problems from quadratic mea- 
surements arise in a variety of different problems in optics, as we discuss in the next example. 

Example 1.2. Recovery of sparse vectors from quadratic measurements has been treated 
recently in the context of sub- wavelength optical imaging [Til 120] • In these problems the goal 
is to recover a sparse image from its far-field measurements, where due to the laws of physics 
the relationship between the (clean) measurement and the unknown image is quadratic. In 
the quadratic relationship is a result of using partially-incoherent light. The quadratic 



behavior of the measurements in [TT] is a result of coherent diffractive imaging in which 
the image is recovered from its intensity pattern. Under an appropriate experimental setup, 
this problem amounts to reconstruction of a sparse signal from the magnitude of its Fourier 
transform. 

Mathematically, both problems can be described as follows: Given m symmetric matrices 
Ai, . . . , Am e M"^", find a vector x satisfying: 

||x||o < s. 
This problem can be written in the form of problem (P) with 

m 
/qu(x) = Yl {^^^i^ - Ci) • 

In this case, the objective function is nonconvex and quartic. 

Quadratic measurements appear more generally in phase retrieval problems, in which a 
signal x is to be recovered from the magnitude of its measurements i/i = |d*x|, where each 
measurement is a linear transform of the input x G M". Note that dj are complex-valued, that 
is di G C". Denoting by bi be the corresponding noisy measurements, and assuming a sparse 
input, our goal is to minimize J^^ii^l ~ |d*xp)^ subject to the constraint that ||x||o < s 
for some s, where m is the number of measurements. The objective function has the same 
structure as /qu with Aj = 3fJ(dj)3fJ(dj)-^ + 3(dj)3(dj)^. In [20], an algorithm was developed 
to treat such problems based on a semidefinite relaxation, and low-rank matrix recovery. 
However, for large scale problems, the method is not efficient and difficult to implement. An 
alternative algorithm was designed in [11] based on a greedy search. This approach requires 
solving a nonconvex optimization program in each internal iteration. 

To conclude this example, we note that the problem of recovering a signal from the mag- 
nitude of its Fourier transform has been studied extensively in the literature. Many methods 
have been developed for phase recovery [T5| which often rely on prior information about the 
signal, such as positivity or support constraints. One of the most popular techniques is based 
on alternating projections, where the current signal estimate is transformed back and forth 
between the object and the Fourier domains. The prior information and observations are used 
in each domain in order to form the next estimate. Two of the main approaches of this type 
are Gerchberg-Saxton [13J and Fienup [T2|. In general, these methods are not guaranteed 
to converge, and often require careful parameter selection and sufficient signal constraints in 
order to provide a reasonable result. 

In this paper we present a uniform approach to treating problems of the form (P). Necessary 
optimality conditions for problems consisting of minimizing differentiable (possibly noncon- 
vex) objective functions over convex feasibility sets are well known [5]. These conditions are 
also very often the basis for efficient algorithms for solving the respective optimization prob- 
lems. However, classical results on nonconvex optimization do not cover the case of sparsity 
constraints, which are neither convex nor continuous. In Section [2] we derive 3 classes of nec- 
essary optimality conditions for problem (P): basic feasibility, L-stationarity, and coordinate- 
wise (CW) optimality. We then show that CW-optimality implies L-stationarity for suitable 



values of L, and they both imply the basic feasibility property. In Section [3] we present two 
classes of algorithms for solving (P). The first algorithm is a generalization of the iterative 
hard thresholding method, and is based on the notion of L-stationarity. Under appropriate 
conditions we show that the limit points of the method are L-stationary points. The second 
class of methods are based on the concept of CW-optimality. These are basically coordinate 
descent type algorithms which update the support at each iteration by one or two variables. 
Due to their resemblance with the celebrated simplex method for linear programming, we refer 
to these methods as "sparse-simplex" algorithms. As we show, these algorithms are as simple 
as the iterative hard thresholding method, while obtaining stronger optimality guarantees. In 
Section H] we prove the convergence results of the various algorithms, establishing that the 
limit points of each of the methods satisfy the respective necessary optimality conditions. 

2 Necessary Optimality Conditions 

2.1 Notation and Assumptions 

For a given vector x G M" and an index set R C {1, . . . , n}, we denote by x/j the subvector 
of X corresponding to the indices in R. For example, if x = (4, 5, 2, 1)^ and R = {1, 3}, then 
x^ = (4, 2)^. The support set of x is defined by 

/i(x) = {z:xi^0}, 

and its complement is 

/o(x) = {i : Xi = 0} . 

We denote by Cs the set of vectors x that are at most s-sparse: 

Cs = {x : ||x||o < s}. 

For a vector x G MJ^ and i G {1, 2, . . . , n}, the ith largest absolute value component in x is 
denoted by Mj(x), so that in particular 

Mi(x) > M2(x) > . . . > M„(x). 

Also, Mi(x) = maxj=i_...^„ \xi\ and M„(x) = minj=i^...^„ \xi\. 
Throughout the paper we make the following assumption. 

Assumption 1. The objective function f is lower bounded. That is, there exists 7 G M such 
that f{^) >-iforall^eW. 

2.2 Basic Feasibility 

Optimality conditions have an important theoretical role in the study of optimization prob- 
lems. From a practical point of view, they are the basis for most numerical solution methods. 
Therefore, as a first step in studying problem (P), we would like to consider its optimality 
conditions, and then use them to generate algorithms. However, since (P) is nonconvex, it 



does not seem to posses necessary and sufficient conditions for optimality. Therefore, below we 
derive several necessary conditions, and analyze the relationship between them. We will then 
show in Section [3] how these conditions lead to algorithms that are guaranteed to generate a 
point satisfying the respective conditions. 

For unconstrained differentiable problems, a necessary optimality condition is that the 
gradient is zero. It is therefore natural to expect that a similar necessary condition will 
be true over the support /i(x*) of an optimal point x*. Inspired by linear programming 
terminology, we will call a vector satisfying this property a basic feasible vector. 

Definition 2.1. A vector x* G C^ is called a basic feasible (BF) vector of (P) if: 

1. when ||x*||o < s, V/(x*) = 0; 

2. when ||x*||o = s, Vj/(x*) = for all i G /i(x*). 

We will also say that a vector satisfies the "basic feasibility property" if it is a BF vector. 
Theorem 12. II establishes the fact that any optimal solution of (P) is also a BF vector. 

Theorem 2.1. Let x* be an optimal solution of (P). Then x* is a BF vector. 

Proof. If ||x*||o < s, then for any i G {1, 2, . . . ,n} 

G argmin{f7(t) = /(x* + tej)}. 

Otherwise there would exist a to for which /(x* + toGj) < /(x*), which is a contradiction to 
the optimality of x*. Therefore, we have Vi/(x*) = g'{Q) = 0. If ||x*||o = s, then the same 
argument holds for any i G /i(x*). D 

We conclude that a necessary condition for optimality is basic feasibility. It turns out that 
this condition is quite weak, namely, there are many BF points that are not optimal points. 
In the following two subsections we will consider stricter necessary optimality conditions. 

Before concluding this section we consider in more detail the special case of /(x) = /li(x) = 
II Ax — bp. We now show that under a suitable condition on A, which we refer to as s- 
regularity, there are only a finite number of BF points. This implies that there are only a 
finite number of points suspected to be optimal solutions. 

Definition 2.2 (s-regularity). A matrix A G R™^" is called s-regular if for every index set 
I C {1, 2, . . . , n} with \I\ = s, the columns of A associated with the index set I are linearly 
independent. 

Remark 2.1. s-regularity can also be expressed in terms of the Kruskal rank of A: The 
Kruskal rank of a matrix A is equal to s if every s columns of A are linearly independent. 
Another way to express this property is via the spark - spark(A) is the minimum number of 
linearly dependent columns (see [lO])- Thus, A is s-regular if and only if spark(A) > s + 1. 

When s < m, the s-regularity property is rather mild in the sense that if the components of 
A are independently randomly generated from a continuous distribution, then the s-regularity 
property will be satisfied with probability one. 



It is interesting to note that in the compressed sensing hterature, it is typically assumed 
that A is 2s-regular. This condition is necessary in order to ensure uniqueness of the solution 
to b = Ax for any x satisfying ||x||o < s. Here we are only requiring s- regularity, which is a 
milder requirement. 

The next lemma shows that when the s-regularity property holds, the number of BF points 
is finite. 

Lemma 2.1. Let /(x) = /li(x) = || Ax — b|p, where A G R™-^"- is an s-regular matrix and 
b G M™. Then the number of BF points of problem (P) is finite. 

Proof: Any BF vector x satisfies 

||x||o < s and Vi/Li(x) = 0, i G /i(x). 

Denote the support set of x by 5 = /i(x). Then l^l < s and from the derivative condition, 

Af (A5X5 - b) = 0, 

where A^ is the submatrix of A comprised of the columns corresponding to the set 5*. Here 
we used the fact that Ax = A^x^ for any x with support S. By the s-regular ity assumption 
it follows that the matrix A^A5 is nonsingular. Thus, 

X5 = {AlAsY'Alh. 

To summarize, for each set of indices S satisfying l^l < s, there is at most one candidate for 
a BF vector with support S. Since the number of subsets of {1, 2, . . . , n} is finite, the result 
follows. D 

2.3 L-Stationarity 

As we will see in the examples below, the basic feasibility property is a rather weak necessary 
optimality condition. Therefore, stronger necessary conditions are needed in order to obtain 
higher quality solutions. In this section we consider the L-stationarity property which is an 
extension of the concept of stationarity for convex constrained problems. In the next section 
we discuss coordinate-wise optimality which leads to stronger optimality results. 

We begin by recalling some well known elementary concepts on optimality conditions for 
convex constrained differentiable problems (for more details see e.g., [^). Consider a problem 
of the form 

(C): min{^(x) : X G C}, (2.1) 

where C is a closed convex set and (7 is a continuously differentiable function, which is possibly 
nonconvex. A vector x* G C is called stationary if 

(V^(x*), X - X*) > for all ^eC. (2.2) 

If X* is an optimal solution of (P), then it is also stationary. Therefore, stationarity is a 
necessary condition for optimality. Many optimization methods devised for solving nonconvex 
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problems of the form (C) are only guaranteed to converge to stationary points (occasionally 
it is only shown that all limit points of the generated sequence are stationary). 

It is often useful to use the property that for any L > 0, a vector x* is a stationary point 

if and only if 

1, 



x*=Pc(x*--V(7(x*)J, (2.3) 

where for a closed subset D C M", the operator Pd(-) denotes the orthogonal projection onto 
D, that is, 

Poiy) = argmin ||x - yf. 

xS-D 

It is interesting to note that condition (12. 3 p - although expressed in terms of the parameter 
L - does not actually depend on L by its equivalence to 



It is natural to try and extend (12.21) or (12.31) to the nonconvex (feasible set) setting. Con- 
dition (12.21) with g = f and C = C^ is actually not a necessary optimality condition so we do 
not pursue it further. To extend (12. 3p to the sparsity constrained problem (P), we introduce 
the notion of "L-stationarity" . 

Definition 2.3. A vector x* G Cg is called an L-stationary point of (P) if it satisfies the 
relation 

[NC^] X* G Pc. (x* - iv/(x*)^ . (2.4) 

Note that since Cg is not a convex set, the orthogonal projection operator Pcs(') is not 
single- valued. Specifically, the orthogonal projection Pca(x) is a vector consisting of the s 
components of x with the largest absolute value. In general, there could be more than one 
choice to the s largest components. For example: 

Pc.((2,l,l)^) = {(2,l,0)^,(2,0,l)^}. 

Below we will show that under an appropriate Lipschitz condition, L-stationarity is a 
necessary condition for optimality. Before proving this result, we describe a more explicit 
representation of [NCl]- 

Lemma 2.2. For any L > 0, x* satisfies [NC^] if and only if ||x*||o < s and 

|Vf(x*)||^^^^^("*) ^/^^^o(x*), 

'^^^^''^'1=0 z/^G/i(x*). ^^-^^ 

Proof: ([NCl] ^ (D). Suppose that x* satisfies [NC^]. If i G /i(x*), then by [NC^] we 
havex* = x* - iV,/(x*), so that V,/(x*) = 0. Ifi G /o(x*), then \x* - iV,/(x*)| < M,(x*), 
which combined with the fact that x* = implies that |Vi/(x*)| < LMs{^*)., and consequently 
fl23|) holds true. 

ff l23|) ^ [NCl]). Suppose that x* satisfies (Q. If ||x*||o < s, then M,(x*) = and by fl23D 
it follows that V/(x*) = 0; therefore, in this case, Pc, (x* - -iV/(x*)) = PcX^*) is the set 
{x*}. If ||x*||o = s, then M,(x*) 7^ and |/i(x*)| = s. By ([53D 

K - l/LV./(x-)| { ^ Wl ^., ^^^^^^ 



Therefore, the vector x* — -i^V/(x*) contains the s components of x* with the largest absolute 
value and all other components are smaller or equal to them, so that [NC/,] holds. D 

A direct result of Lemma [2.21 is that any L-stationary point is a BF point. 

Corollary 2.1. Suppose that x* is an L-stationary point for some L > 0. Then x* is a BF 
point. 

Remark 2.2. By Lemma [2.21 it follows that the condition for L-stationarity depends on L. 
In particular, [NC/,] is stronger/more restrictive as L gets smaller. That is, if x* is an Li 
stationary point, then it is also an L2-stationary point for any L2 > Li. This is a different 
situation than the one described for problems with convex feasible sets where stationarity 
does not depend on any parameter. Based on this observation, it is natural to define the 
stationarity level of a BF vector x* G Cg as the smallest nonnegative L for which condition 
(12. 5p holds. If a BF vector x* satisfies ||x*||o < s, then the stationarity level is zero. If 
||x*||o = s, then the stationarity level, denoted by SLix.*), is given by 

5L X = max ----—, — ^. 

^ ^ iG/o(x*) M,(X*) 

The role of SL will become apparent when we discuss the proposed algorithms. 

In general, L-stationarity is not a necessary optimality condition for problem (P). To 
establish such a result, we need to assume a Lipschitz continuity property of V/. 

Assumption 2. The gradient of the objective function V/ is Lipschitz with constant L{f) 
over M".- 

II V/(x) - V/(y)|| < L(/)||x - y|| for every x, y G M". 

This assumption holds for / = /li with L{f) = 2Amax(A'^A), but not for / = /qu. 
Assumption [2] will not be made throughout the paper and it will be stated explicitly when 
needed. 

It is well known that a function satisfying Assumption [2] can be upper bounded by a 
quadratic function whose associated matrix is a multiple of the identity matrix. This result 
is known as the descent lemma: 

Lemma 2.3 (The Descent Lemma [3J). Let f be a continuously difjerentiable function satis- 
fying AssumptionlE Then for every L > L{f) 

/(x) < /iL(x,y) for any x,y G W, 

where 

/iL(x,y) = /(y) + (V/(y),x-y) + ^||x- yf , x,y G R". (2.6) 

Based on the descent lemma, we can prove the following technical and useful result. 

Lemma 2.4. Suppose that AssumptionlE holds and that L > L{f). Then for any ^ & Cs and 

y G M" satisfying 

yGPc. (x-^V/(x)V (2.7) 
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we have 

/(x)-/(y)>^— f^||x-yf. (2.8) 

Proof. Note that fl2.7p can be written as 



z-(x-iv/(x) 



y G argmm 

After rearrangement of terms, this minimization problem can be easily seen to be equivalent 

to 

y G argmin/i2,(z,x). 

zeCs 

This implies that 

/iL(y,x)</ii(x,x) = /(x). (2.9) 

Now, by the descent lemma we have 

/(x)-/(y)>/(x)-/i^(^)(y,x), 
which combined with (12.91) and the identity 

/iL(/)(x,y) = /iL(x,y) ||x-y|| , 

yields (jMD- □ 

Under Assumption [2] we now show that an optimal solution of (P) is an L-stationary point 
for any L > L{f). 

Theorem 2.2. Suppose that Assumptionl^ holds, L > L{f) and let x* be an optimal solution 
of (P). Then 

(i) X* is an L-stationary point, 
(zz) The set Pc, (x* - ;^V/(x*)) is a szngletoiQ. 

Proof: We will prove both parts simultaneously. Suppose to the contrary that there exists a 
vector 

y e Pcs (x* - ^V/(x*)^ , (2.10) 

which is different from x* (y 7^ x*). Invoking Lemma [2.41 with x = x*, we have 

/(x*)-/(y)>^^^^^||x*-yf, 

contradicting the optimality of x*. We conclude that x* is the only vector in the set 
Pc.(x*-^V/(x*)). D 

To conclude this section, we have shown that under a Lipschitz condition on V/, L- 
stationarity for any L > L{f) is a necessary optimality condition, which also implies the basic 
feasibility property. In Section 13.11 we will show how the iterative hard thresholding method 
for solving the general problem (P), can be used in order to find L-stationary points (for 
L>L{f)). 



^A set is called a singleton if it contains exactly one element. 



2.4 Coordinate-Wise Minima 

The L-stationarity necessary optimality condition has two major drawbacks: first, it requires 
the function's gradient to be Lipschitz continuous and second, in order to vahdate it, we need 
to know a bound on the Lipschitz constant. We now consider a different and stronger necessary 
optimality condition that does not require such knowledge on the Lipschitz constant, and in 
fact does not even require Assumption [2] to hold. 

For a general unconstrained optimization problem, a vector x* is called a "coordinate-wise 
(CW)" minimum if for every i = 1,2, ... ,n the scalar x* is a minimum of / with respect to 
the ith component Xi while keeping all other variables fixed: 

Xj t argmm ji^X]^, . . . ,Xi_i,Xi,Xi^i, . . . ,x„). 

Clearly, any optimal x* is also a coordinate-wise minimum. It is therefore natural to extend 
this definition to problem (P), in order to obtain an alternative necessary condition. 

Definition 2.4. Let x* be a feasible solution of (P). Then x* is called a coordinate-wise (CW) 
minimum of (P) if one of the following cases hold true: 
Case I: ||x*||o < s and for every i = 1,2, . . . ,n one has: 

/(x*)=min/(x*+te,). (2.11) 

Case II: ||x*||o = s and for every i G /i(x*) and j = 1,2, . . . ,n one has: 

/(x*) < min/(x* - x*e, + te,). (2.12) 

Obviously, any optimal solution of (P) is also a CW-minimum. This is formally stated in 
the next theorem. 

Theorem 2.3. Let x* be an optimal solution of (P). Then x* is a CW-minimum of (P). 

It is easy to see that any CW-minimum is also a BF vector, as stated in the following 
lemma. 

Lemma 2.5. Let x* E Cg be a CW-minimum of (P). Then x* is also a BF vector. 

Proof. We first show that if a vector x* satisfying ||x*||o = s is a CW-minimum of (P), then 
(12. lip is satisfied for any i e /i(x*). Indeed, let i G /i(x*) and take j = i. Then (12.121) 
becomes 

/(x*)<min/(x*-a;*e, + te,). (2.13) 

Since /(x* - x*ej + x*ei) = fix.*), it follows that (I2.13P is equivalent to 

/(x*) = min /(x* - x* e^ + tej), 

which letting s = t — x* becomes 

/(x*) =min/(x* + sei). 
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We conclude that for any CW-minimum x* of (P) we have 

Vi/(x*) = for all t G /i(x*). (2.14) 

In addition, in case I we obviously have that V/(x*) = 0, which completes the proof. D 

We have previously established under Assumption [2] in Theorem 12.21 that being an L- 
stationary point for L > L{f) is a necessary condition for optimality. A natural question that 
arises is what is the relation between CW-minima and L-stationary points (for L > L{f)). 
We will show that being a CW-minimum is a stronger, i.e. more restrictive, condition than 
being an L-stationary point for any L > L{f). In fact, a stronger result will be established: 
any CW-minimum is also an L stationary point for an L which is less than or equal to L{f). 
In practice, L can be much smaller than L{f). 

In order to precisely define L, we note that under Assumption [21 it follows immediately 
that for any i ^ j there exists a constant Lij{f) for which 

||V,,-/(x)-V,,/(x + d)|| <L,,,(/)||d||, (2.15) 

for any x G R" and any d G M" which has at most two nonzero components. Here Vjj/(x) 
denotes a vector of length-2 whose elements are the ith and jth elements of V/(x). We will 
be especially interested in the following constant, which we call the local Lipschitz constant: 

L2U) ='^^^Lij{f). 
Clearly (12.151) is satisfied when replacing Lij{f) by L[f). Therefore, in general, 

L2{f) < L{f). 
In practice, L2{f) can be much smaller than L{f) as the following example illustrates. 

Example 2.1. Suppose that the objective function in (P) is /(x) = x-'^Qx + 2b^x, with b 
being a vector in MJ^ and 

where I„ is the n x n identity matrix and J„ is the n x n matrix of all ones. Then 

L{f) = 2A^ax(Q) = 2A^ax(I„ + Jn) = 2(n + 1). 

On the other hand, for any i ^ j the constant Lij{f) is twice the maximum eigenvalue of the 
submatrix of Q consisting of the ith and jth rows and columns. That is, 

/2 1\ 

^hjif) — 2Amax I , 9 ) ~ ^• 

For large n, L{f) = 2n + 2 can be much larger than L2{f) = 6. 

It is not difficult to see that the descent lemma (Lemma 12.31) can be refined to a suitable 
"local" version. 
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Lemma 2.6 (Local Descent Lemma). Suppose that Assumption\^ holds. Then 

/(x + d) < /(x) + V/(x)^d + ^||df 

for any vector d G M" with at most two nonzero components. 

Using the local descent lemma we can now show that a CW- minimum is also an L2(/)- 
stationary point. 

Theorem 2.4. Suppose that AssumptionlE holds and let x* be a CW-minimum of (P). Then 

|V./(x-)| I ^ i^(/)A'.(-*) ' ^ '«(-•). (2.16) 

' '•'^ ^' \ =0 zG/i(x*), ^ ^ 

i/iai is, X* zs an L2{f) -stationary point. 

Proof: Since x* is a CW-minimum, it follows by Lemma 12.51 that it is a BF vector. Thus, 
since ||x*||o < s, we have V/(x*) = 0, establishing the result for this case. 

Suppose now that ||x*||o = s. Let i G /i(x*). Then, again, by Lemma [2751 it follows that 
X* is a BF vector and thus Vj/(x*) = 0. Now let i G /o(x*) and let m be an index for which 
|x^| = Ms(x*). Obviously, m G /i(x*), and thus, since x* is a CW-minimum, it follows in 
particular that 

/(x*) < /(x* - x*^e^ - axle,), (2.17) 

where a = sgn (xj!^Vj/(x*)). By the local descent lemma (Lemma 12.61) we have 



/(x* - x*^em - axi.ei) < /(x*) + V/(x*)^(-a;;,e„ - (rx*^ei) + ^^||a;>^ + ax*^e,f 

= /(x*) - xlV.^f{^*) - crx^VJi^*) + L2{f){x*J' 

= /(x*) - crx^VJi^n + L,if){x*j', (2.18) 

where the last equality follows from the fact that since m G /i(x*), it follows by (12.141) that 
V^/(x*) = 0. 

Combining (12.171) and (I2.18P we obtain that 

0<-ax*^ydi^') + L2if)ix*J'. 

Recalling the definition of a, we conclude that 

\xlVJ{^*)\<L2{f){xir, 

which is equivalent to 

|V,/(x*)|<L2(/)K|=L2(/)M,(x*), 

concluding the proof. D 

An immediate consequence of Theorem 12.41 is that under Assumption [21 any optimal solu- 
tion of (P) is an L2(/)— stationary point. 

12 



Corollary 2.2. Suppose that Assumption IB holds. Then any optimal solution of (P) is also 
an L2{f)— stationary point of (P). 

To summarize our discussion on optimality conditions we have shown that without As- 
sumption [2] we have the following relations: 



Theorem 12.31 
Lemma 12.51 



Under Assumption [2l we have: 



Theorem 12.31 
Theorem 12.41 
Corrolary 12.11 



optimal solution of (P) 

CW-minimum of (P) 

BF vector of (P) 



optimal solution of (P) 
CW-minimum of (P) 

-^2(/) — stationary 
BF vector of (P) 



To illustrate these relationships we consider a detailed example. 
Example 2.2. Consider problem (P) with s = 2,n = 5 and 

/(x) =x^Qx + 2b^x, 



where Q = I5 -|- J5 as in Example 12. ![ and b = —(3, 2, 3, 12, 5)^. In Lemma 12.11 we showed 
how to compute the BF vectors of problem (P) with a quadratic objective. Using this method 
it is easy to see that in our case there are 10 BF vectors given by (each corresponding to a 
different choice of two variables out of 5): 

xi = (1.3333,0.3333,0,0,0)^, 

X2 = (1.0000,0,1.0000,0,0)^, 

X3 = (-2.0000,0,0,7.0000,0)^, 

X4 = (0.3333,0,0,0,2.3333)^, 

X5 = (0,0.3333,1.3333,0,0)^, 

X6 = (0,-2.6667,0,7.3333,0)^, 

X7 = (0,-0.3333,0,0,2.6667)^, 

xg = (0,0, -2.0000, 7.0000, 0)^ 

xg = (0,0,0.3333,0,2.3333)'^, 

xio = (0,0,0,6.3333,-0.6667)^. 

The stationarity levels and function values up to two digits of accuracy of each of the BF 
vectors is given in Table [T] 
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BF vector number 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


function value 
stationarity level 


-4.66 
62 


-6.00 
20 


-78 
3 


-12.66 
56 


-4.66 
62 


-82.66 
1.25 


-12.66 

58 


-78 
3 


-12.66 
56 


-72.66 
11 



Table 1: Function values and stationarity levels of the 10 BF vectors. 



Since in this case -^2(/) = 6 (see Example 12.11) . it follows by Corollary 12.21 that any 
optimal solution is a 6-stationary point, implying that only the three BF vectors X3,xe,X8 
are candidates for being optimal solutions. In addition, by Theorem 12.41 only these three 
BF vectors may be CW-minima. By direct calculation we found that only xg - the optimal 
solution of the problem - is a CW-minima. Therefore, in this case, the only CW-minima is 
the global optimal solution. Note, however, that there could of course be examples in which 
there exist CW-minima which are not optimal. 

3 Numerical Algorithms 

We now develop two classes of algorithms that achieve the necessary conditions defined in the 
previous section: 

• Iterative hard thresholding (IHT). The first algorithm, results from using the L- 
stationary condition. For the case f = fi^i, and under the assumption that ||A||2 < 1, it 
coincides with the IHT method [1] . Our approach extends this algorithm to the general 
case under Assumption [21 and it will be refereed to as "the IHT method" in our general 
setting as well. We will prove that the limit points of the algorithm are L(/)— stationary 
points. As we show, this method is well-defined only when Assumption [2] holds and relies 
on knowledge of the Lipschitz constant. 

• Sparse-simplex methods. The other two algorithms we suggest are essentially co- 
ordinate descent methods that optimize the objective function at each iteration with 
respect to either one or two decision variables. 

The first algorithm in this class seeks the coordinate, or coordinates, that lead to the 
largest decrease and optimizes with respect to them. Since the support of the iter- 
ates changes by at most one index, it has some resemblance to the celebrated simplex 
method for linear programming and will thus be referred to as "the greedy sparse-simplex 
method" . We show that any limit point of the sequence generated by this approach is a 
CW-minima, which as shown in Theorem 12.41 is a stronger notion than L— stationarity 
for any L > -Z^2(/)- An additional advantage of this approach is that it is well defined 
even when Assumption [2] is not valid, and does not require any knowledge of the Lip- 
schitz constant even when one exists. The disadvantage of the greedy sparse-simplex 
method is that it does not have a selection strategy for choosing the indices of the 
variables to be optimized, but rather explores all possible choices. Depending on the 
objective, this may be a very costly step. 
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To overcome this drawback, we suggest a second coordinate descent algorithm with an 
extremely simple index selection rule; this rule discards the need to perform an exhaus- 
tive search for the relevant indices on which the optimization will be performed. This 
approach will be refereed to as "the partial sparse-simplex method". Under Assumption 
[2] we show that it is guaranteed to converge to L2(/)-stationary points. 

In the ensuing subsections we consider each of the algorithms above. We present the 
methods along with statements regarding their convergence properties. The detailed proofs 
of the convergence results are deferred to Section |H 

3.1 The IHT Method 

One approach for solving problem (P) is to employ the following fixed point method in order 
to "enforce" the L-stationary condition (12.41) : 

x^+i G Pc. (x'^ - ^V/(x'=)y k = 0,1,2,... (3.1) 

Convergence results on this method can be obtained when Assumption |2] holds; we will 
therefore make this assumption throughout this subsection. The iterations defined by (13. ip 
were studied in |1] for the special case in which / = /li and ||A||2 < 1, and were re- 
ferred to as the "M-sparse" algorithm. Later on, in [5], the authors referred to this ap- 
proach as the IHT method (again, for / = /li) and analyzed a version with an adap- 
tive step size which avoids the need for the normalization property ||A||2 < 1. Simi- 
arly, we refer to this approach for more general obejctvie functions as the IHT method: 



The IHT method 

Input: a constant L > L{f). 

• Initialization: Choose xq G C^. 

• General step : x^'+i G Pp. (x'' - ;^V/(x'=)) , {k = 0, 1, 2, 



It can be shown that the general step of the IHT method is equivalent to the relation 

x^+^ G argmin/iL(x,x^), (3.2) 

xgCs 

where /^^(x, y) is defined by (12.61) (see also the proof of Theorem 12.21) . 

Several basic properties of the IHT method are summarized in the following lemma. 

Lemma 3.1. Let {x^}jt>o be the sequence generated by the IHT method with a constant stepsize 
J where L > L{f). Then 

1. /(x^) - /(x'^+i) > ^^^llx'^ - x'^+if^ 

2. {/(x'^)}fc>o is a nonincreasing sequence. 

3. ||x'=-x^+i|| ^0. 

4- For every k = 0,1,2,. . ., ij ^ i^ x''^^ then /(x^+^) < /(x''). 
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Proof: Part 1 follows by substituting x = x.'','y = x.^^^ in (12. 8p . Parts 2,3 and 4 follow 

immediately from part 1. D 

A direct consequence of Lemma 13.11 is the convergence of the sequence of function values. 

Corollary 3.1. Let {x'^}fc>o be the sequence generated by the IHT method with a constant 
stepsize -^ where L > L{f). Then the sequence {/(x^)}fc>o converges. 

As we have seen, the IHT algorithm can be viewed as a fixed point method for solving 
the condition for L-stationarity. The following theorem states that all accumulation points of 
the sequence generated by the IHT method with constant stepsize ^ are indeed L-stationary 
points. 

Theorem 3.1. Let {x'^}fc>o be the sequence generated by the IHT method with stepsize -^ 
where L > L{f). Then any accumulation point of {x'^}k>o is an L-stationary point. 

Proof. See Section SXH 

3.1.1 The Case / = fu 

When /(x) = /li(x) = || Ax — bp, and under the assumption of s-regularity of (P), we know 
by Lemma 12.11 that the number of BF vectors is finite. Utilizing this fact we can now show 
convergence of the whole sequence generated by the IHT method when / = /li- This result 
is stronger than the one of Theorem 13. ![ which only shows that all accumulation points are 
L-stationary points. 

Theorem 3.2. Let /(x) = /li(x) = || Ax — bp. Suppose that the s-regularity property holds 
for the matrix A. Then the sequence generated by the IHT method with stepsize -^ where 
L > L{f) converges to an L-stationary point. 

Proof. See Section HX^ 

Remark 3.1. As we have noted previously, the IHT method in the case / = /li with fixed 
step-size set to 1 was proposed in |3]. It was shown in [^ that if A satisfies the s-regularity 
property and ||A||2 < 1, then the algorithm converges to a local minimum. This result is 
consistent with Theorem 13.21 since when ||A||2 < 1, the Lipschitz constant satisfies L{f) < 1, 
and we can therefore assure convergence by Theorem 13.21 with stepsize equal to 1. In [5] 
the authors note that the IHT method with stepsize 1 might diverge when ||A||2 > 1. To 
overcome this limitation, they propose an adaptive stepsize for which they show the same 
type of convergence results. Our result here shows that a fixed step size which depends on 
the Lipschitz constant can also be used. 

3.1.2 Examples 

Example 3.1. Consider the problem 

min{/(xi,X2) = 12a;? + 20xia;2 + 32a;2 : ||(xi; X2)^||o < l} . (3.3) 
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L=250 



L=500 





Figure 1: The optimal solution (0, —9/16)"^ is denoted by a red asterisk and the additional 
BF vector (—1/12, 0)"^ is denoted by a red diamond. The region of convergence to the optimal 
solution is the blue region and the points in the white region converged to the non-optimal 
point (—1/12,0)^. The left image describes the convergence region when the IHT method 
was invoked with L = 250, while the right image describes the same for L = 500. When L 
gets larger, the chances to converge to the non-optimal L-stationary point are higher. 

The objective function is convex quadratic and the Lipschitz constant of its gradient is given 

by 

'12 10' 



Hf) = 2\ 



10 16 



48.3961. 



It can be easily seen that there are only two BF vectors to this problem: 
(0, —9/16)"^, (—1/12, 0)"^ (constructed by taking one variable to be zero and the other to 
satisfy that the corresponding partial derivative is zero). The optimal solution of the problem 
is the first BF vector (0, —9/16)"^ with objective function value of —81/16. This point is an 
L-stationary point for any L > L{f). The second point (—1/12, 0)^ is not an optimal solution 
(its objective function value is —1/12). Since V2/((— 1/12, 0)"^) = 49/3, it follows by Lemma 
12.21 that it is an L-stationary point for L > Y/12 = 196. Therefore, for any L G [L(/), 196), 
only the optimal solution (0, —9/16)"^ is an L-stationary point and the IHT method is guar- 
anteed to converge to the global optimal solution. However, if the upper bound is chosen to 
satisfy L > 196, then (—1/12, 0)^ is also an L-stationary point and the IHT method might 
converge to it. This is illustrated in Figure 13.1.21 

Example 3.2. For any two positive number a < b, consider the problem 

min{/(xi,X2) = a{xi - 1)^ + 6(^2 - 1)^ : ||(xi,X2)^||o < !}• 

Obviously the optimal solution of the problem is (xi, X2) = (0, 1). An additional BF vector is 
i = (1,0)^. Note that here L(/) = 26. Therefore, since V/(S:) = (0, -26)^ and Mi(i) = 1, 
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it follows that 

|V2/(x)|<L(/)Mi(x), 

and hence x will also be an L-stationary point for any L > L{f). Therefore, in this problem, 
regardless of the value of L, there is always a chance to converge to a non-optimal solution. 

3.2 The Greedy Sparse-Simplex Method 

The IHT method is able to find L-stationary points for any L > L{f) under Assumption 
O However, by Corollary 12.21 any optimal solution is also an L2(/)-stationary point, and 
^2(1) can be significantly smaller than L{f). It is therefore natural to seek a method that is 
able to generate such points. An even better approach would be to derive an algorithm that 
converges to a CW-minima, which by Theorem 12. 4[ is a stronger notion than L-stationary. 
An additional drawback of the IHT method is that it requires the validity of Assumption [2] 
and the knowledge of the Lipschitz constant L{f). 

Below we present the greedy sparse- simplex method which overcomes the faults of the IHT 
method alluded to above: its limit points are CW-minima, it does not require the validity of 
Assumption [2], but if the assumption does hold, than its limit points are L2(/)— stationary 
points (without the need to know any information on Lipschitz constants). 



The Greedy Sparse-Simplex Method 




• Initiahzation: Choose xq G C^. 




• General step : (A; = 0, 1, . . .) 




• If x^ < s, then compute for every i = 1, 2, . . . , n 




ti e argmin/(x'' + tei), 


(3.4) 


fi = min/(x^' + tei). 




Let ik G argmin/j. If /j^ < /(x'^), then set 

i=l,...,n 




X -p Zi^Gi^. 




Otherwise, STOP. 




• If 1 x'^l = s, then for every i G /i(x'^) and j = 1, . . . ,n compute 




tij G argmin/(x — x^ei + tej), 


(3.5) 


fij = minfi^'' -x'lei + tej). 




Let {ikjk) G argmin{/ij : i G /i(x''), j = 1, . . . , n}. If /j^j^ < /(x''), 


then set 


^ = X — Xj^ej^ + tif,j^,ej^. 




Otherwise, STOP. 





Remark 3.2. One advantage of the greedy sparse-simplex method is that it can be easily 
implemented for the case / = /qu, that is, the case when the objective function is quartic. 
In this case the minimization steps (13. 4 p and ( 13. Sp consist of finding the minimum of a scalar 
quartic (though nonconvex) function, which is an easy task since the minimizer is one of the 
at most three roots of the cubic polynomial derivative. 

By its definition, the greedy sparse-simplex method generates a non-increasing sequence 
of function values and gets stuck only at CW-minima. 

Lemma 3.2. Let {x^} he the sequence generated by the greedy sparse- simplex method. Then 
j'(x.k+i^ < f{x.^) for every k > and equality holds if and only if x.^ = x.^'^^ and x'^ is a 
CW-minimum. 

Theorem 13 . 3 1 est ablishes the main convergence result for the greedy simplex-sparse method, 
namely that its accumulation points are CW-minima. 

Theorem 3.3. Let {x*^} be the sequence generated by the greedy sparse- simplex method. Then 
any accumulation point of {x.^} is a CW-minimum of (P). 
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Proof. See Section K2A[ 

Combining Theorem 13.31 with Theorem 12.41 leads to the following corollary. 

Corollary 3.2. Suppose that AssumptionlE holds and let {x'^} be the sequence generated by the 
greedy sparse- simplex method. Then any accumulation point of {'x.^} is an L2{f) -stationary 
point of (P). 



3.2.1 The Case / = /] 



LI 



We consider now the greedy sparse-simplex method when / = /lj. At step (13. 4 p we perform 
the minimization tj = argmin/(x'^ + tej). Since /(x'^ + tej) = HAx*^ — b + taj|p (aj being the 
ith column of A), we have immediately that 



afrfc 



a, 



12' 



where r^ = Ax'^ — b. We can then continue to compute 

T 



f^ 

so that 



ai M 



r.f-¥^' (3-6) 



Ia,;'l2 



Ik ^ argmm jj = argmax ■ 



i=l,...,n. i=l,...,n. 11^-4 1| 

The algorithm then proceeds as follows. For ||x^||o < s we choose 



If a^Ffc 7^ 0, then we set 



In this case, 



I T I 

Ik € argmax— I — —. (3.7) 

i=l,...,n W'^iW 



T 
k+l _^k _ ^ijk 



Ffc+i = Ax^'+^ - b = r;t 



af.rfc 



II 110 ^ h ' 

II H- II 

Otherwise we stop. Note, that if A has full row-rank, then a^r^ = only if r^ = 0. 



For llx'^lln = s we choose 



- " "■" "^"""" 

I T ?■ I 
f M \^j^k\ 

{ik,Jk) = argmax n n , 



je/i(x'=)je{i,2,...,n} \\^j\\ 
with r*fc = Ax'^ - xfa^ - b. Let /j^j, = /(x^ - xf^e^^ + te^J with 



If /i;c,j* < /(x''), then we set 



II Jfc II 



fc+1 _ k _ k _ ^hH 

X —X -t-ifct^lfc II 112'^Jfe- 
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Otherwise we stop. 

It is interesting to compare the resulting iterations with the matching-pursuit (MP) algo- 
rithm [TS] designed to find a sparse solution to the system Ax = b. The MP method begins 
with an initial guess of x" = and fq = b. At each iteration, we add an element to the 
support by choosing 

TTiGargmax— ^ — —. \A-°) 

j=l,2,...,n ll^ill 

The current estimate of x is then updated as 

x^+^ = x^ - TT^e^, (3.9) 



and the residual is updated as 



m 



^fc+i ^ A^.+i _ b = r, - T^a„. (3.10) 



The iterations continue until there are s elements in the support. Evidently, the MP method 
coincides with our method as long as the support is smaller than s. Our approach however 
has several advantages: 

• We do not need to initialize it with a zero vector; 

• In MP once an index m is added to the support it will not be removed unless in some it- 
eration a^Ffc = Xm||am||^ and m maximizes afrfc/||aj||. On the other hand, our approach 
allows to remove elements from the support under much broader conditions. Thus, there 
is an inherent "correction" scheme incorporated into our algorithm; 

• In MP the algorithm stops once the maximal support is achieved. In contrast, in our 
approach, further iterations are made by utilizing the correction mechanism. 

We note that once our method converges to a fixed support set, it continues to update the 
values on the support. Ultimately, it converges to the least-squares solution on the support 
since in this situation the method is a simple coordinate descent method employed on a 
convex function. This is similar in spirit to the orthogonal MP (OMP) approach [TTj. The 
OMP proceeds similarly to the MP method, however, at each stage it updates the vector x'^ 
as the least-squares solution on the current support. In our approach, we will converge to 
the least-squares solution on the final support, however, in choosing the support values we 
do not perform this orthogonalization. Instead, we allow for a correction stage which aids in 
correcting erroneous decisions. 

3.2.2 Examples 

Example 3.3. Consider the sparse least squares problem 

(Pa) min{||Ax-bf ixeCa}, 
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where A G R^^^ and b G M^ are given by (up to 4 digits of accuracy' 


\ 




/0.8899 -0.4355 0.5304 -0.2324 0.3745 \ ( 1.3254 ^ 




0.0797 -0.3475 0.0942 0.9681 -0.4919 


,b = 


0.4272 




0.4425 0.3248 0.6921 0.0921 0.7575 


0.1177 




1^0.0773 0.7643 -0.4804 0.0142 0.2099 ) 




1^-0.6870/ 



The matrix A was constructed as follows: first, the components were randomly and inde- 
pendently generated from a standard normal distribution, and then all the columns were 
normalized. The vector b was chosen as b = Axtruc, where Xtrue = (1, ~1) 0, 0, 0)"^, so that 
Xtiuc is the optimal solution of the problem. The problem has 10 BF vectors (corresponding 
to the 5-choose-2 options for the support of the solution) and they are denoted by 1, 2, . . . , 10, 
where the first solution is the optimal solution Xtruc- The corresponding objective function 
values and stationarity levels (with two digits of accuracy) are given in Table [2l 



BF vector number 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


function value 
stationarity level 


-2.42 
0.00 


-1.60 
2.90 


-1.51 

8.46 


-1.99 
0.91 


-1.99 

1.08 


-1.48 
13.97 


-2.11 
0.69 


-1.33 

18.70 


-1.61 
1.50 


-0.11 
9.05 



Table 2: Function values and stationarity levels of the 10 BF vectors of (-P2)- 

In this problem L{f) = 4.78 and i^2(/) = 3.4972. We compared three methods: 

• the IHT method with Li = l.lL{f). 

• the IHT method with La = 2L(/). 

• the greedy sparse-simplex method. 

Each of these methods was run 1000 times with different randomly generated starting 
points. All the runs converged to one of the 10 BF vectors. The number of times each method 
converged to each of the BF vectors is given in Table O 



BF vector (z) 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


N,{^) 


329 


50 


63 


92 


229 





130 





61 


46 


N2i^) 


340 


59 





89 


256 





187 





69 





Nsit) 


813 








112 








75 











N,it) 


772 








92 








93 





43 






Table 3: Distribution of limit points among the 10 BF vector. Ni{i) [N2{i)) is the amount 
of runs for which the IHT method with Li (L2) converged to the ith BF vector. N^li) is the 
amount of runs for which the greedy sparse-simplex method converged to the ith BF vector. 
The exact definition of N^li) will be made clear in Section [331 

First note that when employing the IHT method with L2 = 2L(/) = 9.56, the method 
never converged to the BF vectors 6,8. The theoretical reason for this phenomena is simple: 
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the stationarity levels of these two points are 13.97 and 18.70, and they are therefore not 
9.56-stationary points. When Li = 1.1 ■ L{f) = 5.26, there are two additional BF vectors 
- 3 and 10 - to which convergence is impossible, because their stationarity level is 8.46 and 
9.05. This illustrates the fact that as L gets larger, there are more non-optimal candidates 
to which the IHT method can converge. The greedy sparse-simplex method exhibits the best 
results with more than 80% chance to converge to the true optimal solution. Note that this 
method will never converge to the BF vectors 3, 6, 8 and 10 since they are not L2(/)-stationary 
points. Moreover, there are only three possible BF vectors to which the greedy sparse-simplex 
method converged: 1,4 and 7. The reason is that among the 10 BF vectors, there are only 
three CW- minima. This illustrates the fact that even though any CW-minimum is an L2(/)- 
stationary point, the reverse claim is not true - there are L2(/)-stationary points which are 
not CW-minima. 

In Table m we describe the 11 first iterations of the greedy sparse-simplex method. Note 
that at the 4th iteration the algorithm "finds" the correct support and the rest of the iterations 
are devoted to computing the exact values of the nonnegative components of the BF vector. 



iteration number 


Xi 


X2 


X3 


X4 


X5 








1 


5 








1 





1.0000 


1.5608 








2 








1.5608 





-0.6674 


3 


1.6431 











-0.6674 


4 


1.6431 


-0.8634 











5 


1.0290 


-0.8634 











6 


1.0290 


-0.9938 











7 


1.0013 


-0.9938 











8 


1.0013 


-0.9997 











9 


1.0001 


-0.9997 











10 


1.0001 


-1.0000 











11 


1.0000 


-1.0000 












Table 4: First 11 iterations of the greedy sparse-simplex method with starting point 
(0,1, 5, 0,0, Of. 



Example 3.4 (Comparison with MP and OMP). To compare the performance of MP and 
OMP to that of the greedy sparse-simplex, we generated 1000 realizations of A and b exactly 
as described in Example 13.31 We ran both MP and OMP on these problems with s = 2. Each 
of these methods were considered "successful" if it found the correct support (MP usually 
does not find the correct values). The greedy sparse-simplex method was run with an initial 
vector of all zero, so that the first two iterations were identical to MP. The results were the 
following: out of the 1000 realizations both MP and OMP found the correct support in 452 
cases. The greedy sparse-simplex method, which adds "correcting" steps to MP was able to 
recover the correct support in 652 instances. 
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An additional advantage of the greedy sparse-simplex method is that it is capable of 
running from various starting points. We therefore added the following experiment: for each 
realization of A and b, we ran the greedy sparse-simplex method from 5 different initial vectors 
generated in the same way as in Example 13.31 (and not the all zeros vector). If at least one of 
these 5 runs detected the correct support, then the experiment is considered to be a success. 
In this case the correct support was found 952 times out of the 1000 realizations. 

The example above illustrates an important feature of the greedy sparse-simplex algorithm: 
since it can be initialized with varying starting points, it is possible to improve its performance 
by using several starting points and obtaining several possible sparse solutions. The final 
solution can then be taken as the one with minimal objective function value. This feature 
provides additional flexibility over the MP and OMP methods. 

3.3 The Partial Sparse- Simplex Method 



The greedy sparse-simplex method, as illustrated in Example 13. 3 [ has several advantages over 
the IHT method: first, its limit points satisfy stronger optimality conditions, and as a result 
is more likely to converge to the optimal solution and second, it does not require knowledge of 
a Lipschitz constant. On the other hand, the computational effort per iteration of the greedy 
sparse-simplex algorithm is larger than the one required by the IHT approach. Indeed, in 
the worst case it requires the call for 0{s ■ n) one- dimensional minimization procedures; this 
computational burden is caused by the fact that the method has no index selection strategy. 
That is, instead of deciding a priori according to some policy on the index or indices on which 
the optimization will be performed, the algorithm invokes an optimization procedure for all 
possible choices, and then picks the index resulting with the minimal objective function value. 
The partial sparse-simplex method described below has an extremely simple way to choose 
the index or indices on which the optimization will be performed. The only difference from 
the greedy sparse-simplex algorithm is in the case when Hx'^Hq = s, where there are two 
options. Either perform a minimization with respect to the variable in the support of x*^ 
which causes the maximum decrease in function value; or replace the variable in the support 
with the smallest absolute value (that is, substituting zero instead of the current value), with 
the non-support variable corresponding to the largest absolute value of the partial derivative 
- the value of the new non-zero variable is set by performing a minimization procedure with 
respect to it. Finally, the best of the two choices (in terms of objective function value) is 
selected. Since the method is no longer "greedy" and only considers part of the choices for 
the pair of indices, we will call it the partial sparse- simplex method. 
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The Partial 


Sparse-Simplex Method 






• Initialization: ^x'^ e Cs- 






• General Step (k = 0, 1, 2, . . .): 






• If x'^ 


< s, then compute for every i = 1, 2, . . . , n 
ti e argmin/(x'' + tej), 
fi = mm f{x.'' + tei). 






Let Zfc G 


argmin/j. If /j^ < /(x*^), then set 

i=l,...,n 






Otherwise, STOP. 






• If x'^ 


= s, then compute for every i G /i(x*) 

U G argmin/(x'' + tei), 
fi = min ffx'' + tej). 






Let 


i\ G argmax{/i : i G I\{y^)}, 








il G argmax{|Vi/(x'=)| :zG/o 


(x'^)} 






ruk G argmin{ xf | : z G /ilx'')}, 






and let 








Dl = 


minteiK /(x'^' + tea ) , T^^ G argmin 


/(x'^ 


+ te..) 


Dl = 


minteK /(x'^ - x^ e^^ + teii), T| G argmin 

4GM 


/(x'^ 


- ^t.f^m, + teq) 


liDl< 


D^, then set 

x^+i=x'= + r,ie,i. 






Else 


x^-+i = x^--x^^e^,+T,%. 







Remark 3.3. The partial sparse-simplex method coincides with the greedy sparse-simplex 
method when Hx'^Ho < s. Therefore, when / = /li, the partial sparse-simplex method coin- 
cides with MP for the first s steps and when the initial vector is the vector of all zeros. 

The basic property of the partial sparse-simplex method is that it generates a nonincreasing 
sequence of function values and that all its limit points are BF vectors. 
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Lemma 3.3. Let {x'^} be the sequence generated by the partial sparse- simplex method. Then 
any accumulation point of {x'^} is a BF vector. 

Proof. See Section 11221 

The limit points of the partial sparse-simplex method are not necessarily CW-minima. 
However, when Assumption [2] holds, they are L2(/)-stationary points, which is a better result 
than the one known for the IHT method. 

Theorem 3.4. Suppose that AssumptionlE holds and let {'x.^} be the sequence generated by the 
partial sparse- simplex method. Then any accumulation point of {x.''} is an L2{f) -stationary 
point. 

Proof. See Section gXl 

We end this section by returning to Example 13.31 and adding a comparison to the partial 
sparse-simplex algorithm. 

Example 13.31 Contd. In Example 13.31 we added 1000 runs of the partial sparse-simplex 
method. The results can be found in Table [3] under Ni{i), which is the amount of times in 
which the algorithm converged to the ith BF vector. As can be seen, the method performs 
very well, much better than the IHT method with either Li = l.lL{f) or L2 = 2L{f). It 
is only slightly inferior to the greedy sparse-simplex method since it has another BS vector 
to which it might converge (BF vector number 9). Thus, in this example the partial sparse- 
simplex method is able to compare with the greedy sparse-simplex method despite the fact 
that each iteration is much cheaper in terms of computational effort. 

Example 3.5 (Quadratic Equations). We now consider an example of quadratic equations. 
Given m vectors ai, . . . , a^, our problem is to find a vector x G M" satisfying: 

(afx)2 = Ci, i = l,2, ...,m, 

ll^llo < s. 

The problem can be formulated as problem (P) with / = /qi where Aj = a^a^. We compare 
the greedy and partial sparse-simplex methods on an example with m = 80, n = 120 and 
s = 3, 4, . . . , 10. As noted previously in Remark 13. 2[ the greedy as well as the partial sparse- 
simplex methods require to solve several one-dimensional minimization problems of quartic 
equations at each iteration. Each component of the 80 vectors ai,...,a80 was randomly 
and independently generated from a standard normal distribution. Then, the "true" vector 
Xtrue was generated by choosing randomly the s nonzero components whose values were also 
randomly generated from a standard normal distribution. The vector c was then determined 
by Cj = (afxtrue)^- For each value oi s {s = 3,4, ...,10), we ran both the greedy and 
partial sparse-simplex methods from 100 different and randomly generated initial vectors. 
The numbers of runs out of 100 in which the methods found the correct solution is given in 
Table El 

As can be clearly seen by the results in the table, the greedy sparse-simplex method 
outperforms the partial sparse-simplex method in terms of the success probability. In addition, 
the chances of obtaining the optimal solution decreases as s gets larger. Of course, we can 
easily increase the success probability of the partial sparse-simplex method by starting it from 
several initial vectors and taking the best result. 
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s 


^PSS 


^GSS 


3 


27 


73 


4 


22 


69 


5 


8 


20 


6 


5 


19 


7 


9 


13 


8 


5 


8 


9 


3 


6 


10 


2 


3 



Table 5: The second (third) column contains the number of runs out of 100 for which the 
partial (greedy) sparse-simplex method converged. 



4 Proofs of Convergence Theorems 

In this section we collect the main convergence theorems of the algorithms proposed in the 
previous section. 

4.1 The IHT Method 

4.1.1 Proof of Theorem EH 

Suppose that x* is an accumulation point of the sequence. Then there exists a subsequence 
{x^"}„>o that converges to x*. By Lemma [3.11 



/(x^") - /(x^"+i) > ^llx'^" - ^^-+Y- 



(4.1) 



Since {/(x'^")}„>o and {/(x'^"+^)}„>o both converge to the same limit /*, it follows that 
/(x^") — /(x^"+^) — !• as n — i- oo, which combined with 04. ip yields that 



Recall that for all ri > 



-j^fcn+i _^ X* as ra -^ oo. 



^fc„+l ^ p^^ fk„ _ 1 yJ(xfe„)^ ^ 



Let i G /i(x*). By the convergence of x*^" and x'^"+^ to x*, it follows that there exists A^ such 
that 

xf", a;f"+^ 7^ for all n> N, 

and therefore, for n > N, 



X 



"^ri~rJ- n^'^n 



Taking n to oo we obtain that 



xt--VJi^'-). 



V./(x*) = 0. 
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Now let i E /o(x*). If there exist an infinite number of indices kn for which x,-"^^ ^ 0, 
then as in the previous case we obtain that x^"~^^ = x-" — Vj/(x'^") for these indices, implying 
(by taking the limit) that Vi/(x*) = 0. In particular, ||Vi/(x*)|| < LMs{-x*). On the other 
hand, if there exists an M > such that for all n > M Xj""*"^ = 0, then 



^'" - ;^V./(x'=" 



< Ms { x'^" - yV/(x'=") ) = M,(x'="+i^ 



Thus, taking n to infinity while exploiting the continuity of the function Mg, we obtain that 

|V./(x*)|<LM,(x*), 
establishing the desired result. □ 

4.1.2 Proof of Theorem [32] 

Let {'x^}k>o be the sequence generated by the IHT method. We begin by showing that the 
sequence is bounded. By the descent property of the sequence of function values (see Lemma 
13. ip . it follows that the sequence {x^}fc>o is contained in the level set 

T = {xGM":/li(x)</li(x°)}. 

We now show that T is bounded. To this end, note that number of subsets oi {1,2, . . . ,n} 
whose size is no larger than s is equal to 



p = Yl 



k=0 

By denoting these p subsets as Ji, I2, ■ ■ ■ , Ip, we can represent the set T as the union: 

p 



T=[JT, 



where 



T,- = {x G M" : /li(x) < /li(x°),x, = Vz ^ /,} . 



In this notation, we can rewrite Tj as 

T, = {x G R" : WAt^^t, - bf < /li(x°),x^ = o} . 

The set Tj is bounded since the s-regularity of A implies that the matrix A^ A7- is positive 
definite. This implies the boundedness of T. 

We conclude that the sequence {x'^}fc>o is bounded and therefore, in particular, there 
exists a subsequence {x'^"}„>o which converges to an accumulation point x* which is an L- 
stationary point, and hence also a BF vector. By Lemma 12. ![ the number of BF vectors is 
finite, which implies that there exists an e > smaller than the minimal distance between all 
the pairs of the BF vectors. To show the convergence of the entire sequence to x*, suppose 
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in contradiction that this is not the case. We will assume without loss of generality that the 
subsequence {x*^"}„>o satisfies Hx*-'" — x*|| < e for every n > 0. Since we assumed that the 
sequence is not convergent, the index t„ given by 

tn = max{/ : ||x* — x* || < e,i = kn,kn + I, ■ ■ ■ ,1} 

is well defined. We have thus constructed a subsequence {x*"}„>o for which 

||x*" - x*|| < e, ||x*"+^ - x*|| > e, n = 0, 1, . . . 

It follows that X*" converges to x*, and in particular there exists an A^ > such that for all 
n > N, ||x*" - x*|| < e/2. Thus, for all n > N, 

W t f -1-1 II ^ 

l|x X II > 2, 
contradicting Part 3 of Lemma [3.11 D. 

4.2 The Sparse-Simplex Methods 

4.2.1 Proof of Theorem ET3] 

By Lemma [3.21 the sequence of function values {/(x'^)} is nonincreasing and by Assumption 
[T]is also bounded below. Therefore, {f{x.^)} converges. Suppose that x* is an accumulation 
point of {x^}. Then there exists a subsequence {x'^"}„>o that converges to x*. Suppose that 
||x*||o = s. Then the convergence of {x'^"} to x* implies that there exists an A^ such that 
/^(x''") = /i(x*) for all n > N. Let i G /i(x*), j G {1,2, ... ,n} and t G M. By definition of 
the method it follows that 

fix.''") - /(x'="+^) > fix.''") - fix''" - x'^"ei + tSj) for all n>N. 

The convergence of {fix''")} implies that when taking the limit n — )■ cxd in the latter inequality, 
we obtain 

0>/(x*)-/(x*-x*e, + te,). 

That is, fix*) < fix* - x*ej + te^) for all i G /i(x*), j G {1,2, ... ,n} and t G M, meaning 
that 

fix*) < min fix* — x*ej + te,) 

for all i G /i(x*) and j G {1,2, ... ,n}, thus showing that x* is a CW-minimum. 

Suppose now that ||x*||o < s. By the convergence of {x'^"} to x*, it follows that there exists 
an A^ for which /i(x*) C /^(x^") for all n > N. Take n > A^; if i G /i(x*), then i G /i(x''"), 
which in particular implies that 

fix''") - /(x'="+i) > fix''") - fix''" + tSi) for all t G M. 

Taking n — >■ oo in the last inequality yields the desired inequality 

/(x*)<min/(x* + te,). (4.2) 

29 



Now suppose that i E /o(x*) and take n > N . If ||x'^"||o < s, then by definition of the greedy 
sparse- simplex method we have 

/(x'^") - /(x'^^+i) > /(x^") - /(x'^" + te,). (4.3) 

On the other hand, if ||x'^"||o = s, then the set /i(x^") \ /i(x*) is nonempty, and we can pick 
an index j„ G /i(x^") \ /i(x*). By definition of the greedy sparse-simplex method we have 

/(x'^") - /(x'^^+i) > /(x'^") - /(x'^" - x^:e,„ + te,) for all t E R. (4.4) 

Finally, combining (14. 3 p and (14. 4 p we arrive at the conclusion that 

/(x'^") - /(x'^^+i) > /(x^") - /(x'^" + d„ + te,), (4.5) 

where 

^ ^ r llx'^'io < s, 

I- ■^jn J" II IIO *• 

Since d„ — )■ as Ti tends to oo, it follows by taking the limit n — )■ oo in (14. 5 p the inequality 

/(x*)</(x* + te,) 
holds for all t G M, showing that also in this case x* is a CW-minimum. D 

4.2.2 Proof of Lemma [33] 

The proof of Theorem 13.31 until equation (14.20 is still valid for the partial sparse-simplex 
method, so that for any i E /i(x*) and any t G M: 

/(x*)</(x* + tei), 

which in particular means that G argmin {gi{t) = /(x* + te,)}, and thus Vi/(x*) = g'^^O) = 
0. D 

4.2.3 Proof of Theorem \3A\ 

The proof of the theorem relies on the following lemma: 

Lemma 4.1. Suppose that AssumptionlE holds and let {x*^} be the sequence generated by the 
sparse- simplex method. Then for any k for which ||x^||o < s it holds that 

/(x^) - /(x'^+i) > -i- , max (V./(x^'))2. (4.6) 

IL2\j) «=l,2,...,n 

For any k with \\^''\\o = s, the inequality 

f{^') - /(x^+i) > A{^') (4.7) 

holds true with 



A(x) = max <; —1- max (V,/(x))2,M,(x) 
ZL2{J) «e-fi{x) 



max |V,/(x)| - max |V,/(x)| - L2(/)M,(x) 
ie/o(x) ie/i{x) 

(4.8; 
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Proof. Suppose that Hx'^Hq < s. Then by the definition of the method we have for all 



i = 1,2, . . . ,n: 



/(x'^+^) < / X 



L2if) 



Vi/(x^)e, 



(4.9) 



On the other hand, for any i = 1,2, 



,n: 



f(^^'-j^^VJ{^')e^ < /(x'^)--^(V./(x'=))2 + ^^(V./(x'^))2 (LemmaES]) 



/(x1 



which combined with (14 .Qp implies that 



2L2{f) 



(VJ(x'^))^ 



/(x'^) - /(x'^+i) > 



k\\2 



2L2(/) -1,2,..., 



max (V,/(x'=)) 



establishing (I4.6p . 

Next, suppose that Hx'^Hq = s. A similar argument to the one just invoked shows that 



fi^') - /(x'^+i) > 



max {VJi^')y. 



By the definition of the greedy sparse-simplex method, it follows that 



(4.10) 



/(x'=)-/(x^+i) > /(x'=)-/(x'=-<e„,+T,2e,.) > /(x^)-/(x'^-x^^e„,-ax^^e,.), (4.11) 



'fc 'fc ^ 



where a = sgn (x^^Vj2/(x'^)). Using the local descent lemma (Lemma 12. 6p once more, we 
obtain that 



/ l-X -^mt ^ruk '^•^mi, ^ 



m-k Ik' 



< f(p^) + Wf{-Ky {-xl^e^, - axl^ei() + 



Uf) 



•^ruk^rnk '^■^nn-^il 



/(x^-) - x'V^JX^') - ax'V.fi^') + L,{f){xl '' 



rrik' 



/(x^) + M,(x'=) L2(/)M,(x'=)-|V,./(x'=)| -xly^JX^'^). 



Combining (14. lip and (I4.12p we obtain that 



max |V,/(x'=)|-L2(/)M,(x^) 
je/o(x'=) 



/(x'^) - /(x'^+i) > M,(x'= 
Finally, fOOj) and f Hl3|l along with the fact that 



^' V„J(x^) 



(4.12) 



(4.13) 



<V^J(x^) > -M,(x^) max |V./(x^ 

JG/l^x*^) 



readily imply the inequality (14. 7p . D 

We now turn to prove Theorem 13.41 Let x* be an accumulation point of the generated 

sequence. Then there exists a subsequence {x'^"}„>o converging to x*. Suppose first that 
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||x*||o = s. Then there exists an A^ > such that Ii(x.^") = /i(x*) for all n > N. Therefore, 
by (14. 7p we have 

/(x'^") - f{x.^"^^) > A(x^^") (4.14) 

for all n > N. Since {/(x'^)} is a nonincreasing and lower bounded sequence, it follows that 
the left hand side of the inequality (I4.14p tends to as n — )■ oo. Therefore, by the continuity 
of the operator A we have A{x*) < from which it follows that 

^ max (V,/(x*))2 = 0, (4.15) 



MJx* 



2L2(/) ie/i(x*) 

max |V./(x*)|- max |V./(x*)| - L2(/)M,(x*) 
je/o(x*) ie/i(x*) 



< 0. (4.16) 



By JKWf it follows that Vi/(x*) = for all i e /i(x*) and substituting this in fli:TB|) 

yields the inequality 

max |Vi/(x*)|<L2(/)M,(x*), 
je/o(x*) 

meaning that x* is an L2(/)-stationary point. 

Now suppose that ||x*||o < s. There are two cases. If there exists an infinite number of n-s 

for which ||x^"||o < s, then by Lemma [4. II for each such n 

/(x^") - f{^'-+') > — i- . max V./(x^")2, 

ZL2[J ) «=l,2,...,7i 

and therefore by taking n — )► oo along the n-s for which Hx'^" ||o < s, we obtain that V/(x*) = 0. 
If, on the other hand, there exists an integer N such that the equality ||x'^"||o = s holds for 
all n > N, then by the definition of the method we have for all n > A^ 

/(x'^") - /(x^"+i) > -i- max (V./(x'="))^ (4.17) 

and 

f{^'")-f{^'"+') > /(x'=")-/(x^"-x^^e„,,+T2e,.) (4.18 



Since T| G argmin/(x'^" — x'^ e^^ + ^6^2), then 

teR * 

/(x " — x^^e^^) — /(x " — x^^e-mj. + T^ej2) > [\/qj{-x " — x^^emj) 



2i^,s;s,<^'^<'''" -"-'=•">»'• 



which combined with (I4.18P yields 



^ inax (V./(x'=" - xt^e^jy < /(x^" - x^^e^J - /(x^"+i). (4.19) 



2L2(/) ie/o(xfc) 



32 



In addition, 



|V,,/(x^")| < |V./(x'=") - VJ{^'- - xte^Jl + \VJ{^'- - xler, 



and thus (14.191) readily implies that: 



= L2(/)M,(x'=") + |V,/(x'=" - <,e. 



max \VJ{p^-)\ < L,{f)M.,{^'") + J 2L^U)[i\^'- - <e^J - /(x'^^+i)], 
ie/o(x'=) * 

which together with f l4.17p yields that for all ?' = 1, 2, . . . , n 

iVJCx'^")! < min |l2(/)M,(x'=") + ^2L2(/)[/(xfc" - x^,e„J - /(x^^+i)], ^2L2(/)/(x'=") - /(x'^^+i) 

Since the righthand side of the latter inequality converges to as n — )■ oo, it follows that the 
desired result V/(x*) = holds. D 
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